This session will continue to explore the uses of
ggplot2 and other packages that can improve your data
visualization and statistical analyses.
Let’s load the required packages
packages_to_install <- c("tidyverse","gapminder","ggpubr","ggridges","ggreveal")
for (package in packages_to_install) {
if (!(package %in% rownames(installed.packages()))) {
install.packages(package, dependencies = TRUE)
print(paste("Installed package:", package))
} else {
print(paste(package, "is already installed"))
}
}
[1] "tidyverse is already installed"
[1] "gapminder is already installed"
[1] "ggpubr is already installed"
[1] "ggridges is already installed"
[1] "ggreveal is already installed"
library(tidyverse)
library(gapminder)
library(ggpubr)
library(ggridges)
library(ggreveal)
Now that we understand the building blocks that go into creating a
plot with ggplot we can make our lives easier by setting a
universal theme at the top of our script that will help keep our figures
looking consistent.
theme_set(theme_bw()) # sets black and white theme for all plots generated in the document
theme_replace( # sets individual plot elements without overwriting theme
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14,
angle = 90,),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.title = element_text(size = 14),
legend.text = element_text(size = 12)
)
You can still manually set all of these plot elements in the plot
code itself, but this means that every plot we make shouldn’t require
quite so much coding to create a consistent visual style.
Create histograms and violin plots to examine distribution
Let’s look at the basic scatter plot we created before using the Iris
dataset.
ggplot(data = iris,
aes(x = Petal.Length,
y = Petal.Width)) +
geom_point(aes(color = Species))

Histograms offer several benefits in data analysis and visualization.
They’re an important part of exploratory analysis of your data.
Visualizing Distribution and Symmetry: Histograms provide a
visual representation of the distribution of a dataset. They allow you
to see the shape, center, and spread of the data. By examining the
histogram, you can quickly identify if the data are skewed, symmetric,
or bimodal, which can provide insights into the underlying patterns and
characteristics of the data. Understanding the skewness of the data can
guide decisions regarding appropriate statistical methods and data
transformations.
Identifying Outliers: Histograms can help identify potential
outliers or unusual observations in the data. Outliers appear as bars
that are significantly separated from the main distribution or as
isolated bars with very low frequencies. Detecting outliers is important
for understanding data quality and assessing the impact they may have on
statistical analyses.
Data Exploration and Comparison: Histograms allow for the
exploration and comparison of multiple variables or subgroups within a
dataset. By creating separate histograms for different groups or
categories, you can visually compare their distributions and understand
the similarities or differences between them.
Let’s plot a basic histogram of Petal.Length, colored by
Species
ggplot(data = iris,
aes(x = Petal.Length)) +
geom_histogram(aes(fill = Species))

We’ll add black outlines to the bars to make them easier to see

Split the histogram into facets per species.
iris_histo +
facet_grid(cols = vars(Species)) # with facet_grid we can choose whether to wrap in rows or columns

iris_histo +
facet_grid(rows = vars(Species)) # this wraps the data in rows

Plot the Sepal length per species as a violin plot
ggplot(data = iris,
aes(x = Species,
y = Sepal.Length,
fill = Species)) +
geom_violin()

Let’s add quantile information to our plots
ggplot(data = iris,
aes(x = Species,
y = Sepal.Length,
fill = Species)) +
geom_violin(draw_quantiles = c( # draws quantile lines
0.25,0.5,0.75
))

Let’s tidy the plot up a little bit!
ggplot(data = iris,
aes(x = Species,
y = Sepal.Length,
fill = Species)) +
geom_violin(alpha = 0.5,
draw_quantiles = c(0.25,0.5,0.75)) +
labs(y = "Sepal length (mm)")

Checking the normality of data
We can check normality of data using the Shapiro-Wilk Test. The
shapiro.test function takes a single numeric vector as its
argument and returns the result of Shapiro-Wilk test. The test evaluates
the null hypothesis that the population from which the data are drawn is
normally distributed. A P-value less than the significance level
(usually 0.05) suggests that the data are not normally
distributed.
lapply(iris[,1:4], # use lapply to apply the shapiro.test function to multiple columns
function(x) shapiro.test(x)) -> normality_tests
# Print the results of normality tests. The results of the normality tests are stored in the normalityTests list, with variable names assigned using names().
names(normality_tests) <- colnames(iris[, 1:4])
print(normality_tests)
$Sepal.Length
Shapiro-Wilk normality test
data: x
W = 0.97609, p-value = 0.01018
$Sepal.Width
Shapiro-Wilk normality test
data: x
W = 0.98492, p-value = 0.1012
$Petal.Length
Shapiro-Wilk normality test
data: x
W = 0.87627, p-value = 7.412e-10
$Petal.Width
Shapiro-Wilk normality test
data: x
W = 0.90183, p-value = 1.68e-08
According to these test results, our only normally distributed data
are the Sepal.Width measurements. Let’s check this by plotting
histograms for each variable using hist(). The
par(mfrow = c(2,2)) command sets up a 2 x 2 grid layout for
the histograms, allowing them to be displayed in separate panels.
par(mfrow = c(2, 2))
hist(iris$Sepal.Length, main = "Sepal Length", xlab = "Length")
hist(iris$Sepal.Width, main = "Sepal Width", xlab = "Width")
hist(iris$Petal.Length, main = "Petal Length", xlab = "Length")
hist(iris$Petal.Width, main = "Petal Width", xlab = "Width")

A slightly more attractive (but possibly less informative) way to
visualize this data is using density ridges from
ggridges.
To use this, we’ll need to adjust the layout of the iris
dataframe to a long format.
iris %>%
pivot_longer(cols = 1:4, # select measurement columns
names_to = "measurement",# set column name for variables
values_to = "value") %>% # set column name for measurements
ggplot(aes(x = value,
y = measurement,
fill = Species)) +
geom_density_ridges(
alpha = 0.5, # set opacity of ridges
scale = 0.98 # set height scaling of ridges
) +
scale_x_continuous(name = "Measurement (cm)") +
scale_y_discrete(name = NULL,
labels = c("Petal.Length" = "Petal Length",
"Petal.Width" = "Petal Width",
"Sepal.Length" = "Sepal Length",
"Sepal.Width" = "Sepal Width"))

Other density plots
Another way to visualize your data is using density plots, also known
as kernel density plots. There are several benefits to visualizing your
data through density plots:
Visualize distribution: Density plots provide a smooth,
continuous estimate of the underlying probability density function (PDF)
of a variable. Unlike histograms, which use discrete bins, density plots
show the shape of the distribution without the binning bias. They
provide a more detailed and nuanced representation of the data
distribution.
Non-Parametric Estimation: Density plots do not assume a specific
distributional form for the data. They are non-parametric and estimate
the density function based on the observed data points. This flexibility
makes them suitable for exploring and visualizing data that may not
adhere to a specific distribution.
Comparison and Overlapping Distributions: Density plots make it
easy to compare multiple distributions or visualize the overlap of
different groups. By using different colors or line types, you can
distinguish and compare distributions visually. This is particularly
useful for identifying differences, similarities, or shifts in
distributions across different groups or variables.
Communicating Findings: Density plots provide an effective way to
communicate distributional characteristics and patterns to others. They
offer a visually appealing and intuitive representation of data
distributions, making it easier for others to grasp the underlying shape
and variability.
ggplot(data = gapminder,
aes(x = gdpPercap,
y = lifeExp)) +
geom_bin2d(
bins = c(50,50) # vectors for number of horizontal and vertical bins
) +
scale_fill_viridis_c(
name = "Number of countries"
) + # uses a viridis scale which is easier to visualize
labs(x = "GDP per capita (USD)", # set x axis title
y = "Life expectancy (years)") + # set y axis title
theme(
legend.position = "top", # set legend position
legend.key.width = unit(2, "cm"), # set legend key width
legend.title.position = "top" # move legend title
)

Adding statistics to your plots
ggplot has arguments for completing simple statistical
tests through the extension ggpubr.
ggpubr offers additional functions and utilities for
data visualization and statistical analysis, specifically designed to
work seamlessly with ggplot2. Let’s add statistics to some
box plots. Here we will compare mean life expectancy between continents
using the gapminder dataset.
Reference: http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/#google_vignette
First we need to check the normality of our data
shapiro.test(gapminder$lifeExp)
Shapiro-Wilk normality test
data: gapminder$lifeExp
W = 0.95248, p-value < 2.2e-16
Our P value is much lower than the 0.05 cutoff value which
means our data are not normally distributed, so we need to take that
into account when choosing a statistical test.
What test would you use to compare means across continents if the
data were normally distributed?

Our Kruskal-Wallis test tells us that there is a significant
difference between groups, but not between which groups. If our data
were normal and we performed an ANOVA we would perform a post-hoc test.
Instead, we can use a pairwise Wilcoxon rank sum test.
pairwise.wilcox.test(gapminder$lifeExp,
gapminder$continent,
p.adjust.method = "BH")
Pairwise comparisons using Wilcoxon rank sum test with continuity correction
data: gapminder$lifeExp and gapminder$continent
Africa Americas Asia Europe
Americas < 2e-16 - - -
Asia < 2e-16 3.5e-07 - -
Europe < 2e-16 < 2e-16 < 2e-16 -
Oceania 9.3e-16 5.7e-08 6.4e-10 0.047
P value adjustment method: BH
Now we can see that all continents are significantly different from
one another!
Preparing plots for presentation
Sometimes we have very busy plots that would benefit from being
revealed in sequence as part of storytelling in a presentation. For
that, we can use ggreveal, which allows us to reveal parts
of a plot sequentially and save each intermediate step for use. Let’s
look at the changes in life expectancy over several decades.

Now that we’ve created our life expectancy plot we can choose how to
reveal the layers.
plot_list
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]







You can click through the different plots that ggreveal
has made. Now let’s save all these plots individually.
reveal_save(
plot_list, # list of plots
"life_exp.tiff"
)
Activities
Green 1
These activities will use the gapminder dataset which is
a subset of country-by-country GDP from gapminder.org from 2010 and
another of R’s built-in datasets, diamonds, which contains
data on diamond size, clarity, and color.
Perform normality tests on the gapminder dataset. Look specifically
at the normality of the data from Europe.
Green 2
Create a histogram using the diamonds data set. Facet so
that they are aligned vertically.
Blue 1
Create a 2D bin (density) plot showing GDP Per Capita and Life
Expectancy using geom_bind2d(). Then, come up with a way to
show all five continents separately
Blue 2
Now add density lines to your plot using
geom_density2d().
Black 1
Using the gapminder data, use one or more techniques to show the
distribution of Life Expectancy over time for each continent.
Hint: Explore the geoms we’ve used to day, or try a new
one! You can look up all available geoms here: https://ggplot2.tidyverse.org/reference/index.html
Black 2
Visualize specific comparisons between continents and label with
brackets and P values or significance markers.
Hint: use the reference above to work out how to do
this
---
title: "Data visualization and statistics"
author: "Yasmin Hilliam, PhD"
date: "2025-07-05"
output: html_notebook
---

This session will continue to explore the uses of `ggplot2` and other packages that can improve your data visualization and statistical analyses.

Let's load the required packages

```{r}
packages_to_install <- c("tidyverse","gapminder","ggpubr","ggridges","ggreveal")

for (package in packages_to_install) {
  if (!(package %in% rownames(installed.packages()))) {
    install.packages(package, dependencies = TRUE)
    print(paste("Installed package:", package))
  } else {
    print(paste(package, "is already installed"))
  }
}

library(tidyverse)
library(gapminder)
library(ggpubr)
library(ggridges)
library(ggreveal)
```

Now that we understand the building blocks that go into creating a plot with `ggplot` we can make our lives easier by setting a universal theme at the top of our script that will help keep our figures looking consistent.

```{r}
theme_set(theme_bw()) # sets black and white theme for all plots generated in the document

theme_replace( # sets individual plot elements without overwriting theme
  axis.title.x = element_text(size = 14),
  axis.title.y = element_text(size = 14,
                              angle = 90,),
  axis.text.x = element_text(size = 12),
  axis.text.y = element_text(size = 12),
  legend.title = element_text(size = 14),
  legend.text = element_text(size = 12)
)
```

You can still manually set all of these plot elements in the plot code itself, but this means that every plot we make shouldn't require quite so much coding to create a consistent visual style.

#### Create histograms and violin plots to examine distribution

Let's look at the basic scatter plot we created before using the Iris dataset.

```{r}
ggplot(data = iris,
       aes(x = Petal.Length, 
           y = Petal.Width)) +
  geom_point(aes(color = Species))
```

Histograms offer several benefits in data analysis and visualization. They're an important part of exploratory analysis of your data.

-   Visualizing Distribution and Symmetry: Histograms provide a visual representation of the distribution of a dataset. They allow you to see the shape, center, and spread of the data. By examining the histogram, you can quickly identify if the data are skewed, symmetric, or bimodal, which can provide insights into the underlying patterns and characteristics of the data. Understanding the skewness of the data can guide decisions regarding appropriate statistical methods and data transformations.

-   Identifying Outliers: Histograms can help identify potential outliers or unusual observations in the data. Outliers appear as bars that are significantly separated from the main distribution or as isolated bars with very low frequencies. Detecting outliers is important for understanding data quality and assessing the impact they may have on statistical analyses.

-   Data Exploration and Comparison: Histograms allow for the exploration and comparison of multiple variables or subgroups within a dataset. By creating separate histograms for different groups or categories, you can visually compare their distributions and understand the similarities or differences between them.

Let's plot a basic histogram of `Petal.Length`, colored by `Species`

```{r}
ggplot(data = iris,
       aes(x = Petal.Length)) +
  geom_histogram(aes(fill = Species))
```

We'll add black outlines to the bars to make them easier to see

```{r}
ggplot(data = iris,
       aes(x = Petal.Length)) +
  geom_histogram(
    aes(fill = Species),
    color = "black" # adds black outline to all bars
  ) -> iris_histo
```

Split the histogram into facets per species.

```{r}
iris_histo +
  facet_grid(cols = vars(Species)) # with facet_grid we can choose whether to wrap in rows or columns

iris_histo +
  facet_grid(rows = vars(Species)) # this wraps the data in rows
```

Plot the Sepal length per species as a violin plot

```{r}
ggplot(data = iris,
       aes(x = Species,
           y = Sepal.Length,
           fill = Species)) +
  geom_violin()
```

Let's add quantile information to our plots

```{r}
ggplot(data = iris,
       aes(x = Species,
           y = Sepal.Length,
           fill = Species)) +
  geom_violin(draw_quantiles = c( # draws quantile lines
    0.25,0.5,0.75
  ))
```

Let's tidy the plot up a little bit!

```{r}
ggplot(data = iris,
       aes(x = Species,
           y = Sepal.Length,
           fill = Species)) +
  geom_violin(alpha = 0.5,
              draw_quantiles = c(0.25,0.5,0.75)) +
  labs(y = "Sepal length (mm)")
```

#### Checking the normality of data

We can check normality of data using the Shapiro-Wilk Test. The `shapiro.test` function takes a single numeric vector as its argument and returns the result of Shapiro-Wilk test. The test evaluates the null hypothesis that the population from which the data are drawn is normally distributed. A P-value less than the significance level (usually 0.05) suggests that the data are **not** normally distributed.

```{r}
lapply(iris[,1:4], # use lapply to apply the shapiro.test function to multiple columns
       function(x) shapiro.test(x)) -> normality_tests

# Print the results of normality tests. The results of the normality tests are stored in the normalityTests list, with variable names assigned using names().
names(normality_tests) <- colnames(iris[, 1:4])
print(normality_tests)
```

According to these test results, our only normally distributed data are the Sepal.Width measurements. Let's check this by plotting histograms for each variable using `hist()`. The `par(mfrow = c(2,2))` command sets up a 2 x 2 grid layout for the histograms, allowing them to be displayed in separate panels.

```{r}
par(mfrow = c(2, 2))

hist(iris$Sepal.Length, main = "Sepal Length", xlab = "Length")

hist(iris$Sepal.Width, main = "Sepal Width", xlab = "Width")

hist(iris$Petal.Length, main = "Petal Length", xlab = "Length")

hist(iris$Petal.Width, main = "Petal Width", xlab = "Width")
```

A slightly more attractive (but possibly less informative) way to visualize this data is using density ridges from `ggridges`.

To use this, we'll need to adjust the layout of the `iris` dataframe to a long format.

```{r}
iris %>%
  pivot_longer(cols = 1:4, # select measurement columns
               names_to = "measurement",# set column name for variables
               values_to = "value") %>% # set column name for measurements
  ggplot(aes(x = value,
             y = measurement,
             fill = Species)) +
  geom_density_ridges(
    alpha = 0.5, # set opacity of ridges
    scale = 0.98 # set height scaling of ridges
  ) +
  scale_x_continuous(name = "Measurement (cm)") +
  scale_y_discrete(name = NULL,
                   labels = c("Petal.Length" = "Petal Length",
                                "Petal.Width" = "Petal Width",
                                "Sepal.Length" = "Sepal Length",
                                "Sepal.Width" = "Sepal Width"))
```

#### Other density plots

Another way to visualize your data is using density plots, also known as kernel density plots. There are several benefits to visualizing your data through density plots:

-   Visualize distribution: Density plots provide a smooth, continuous estimate of the underlying probability density function (PDF) of a variable. Unlike histograms, which use discrete bins, density plots show the shape of the distribution without the binning bias. They provide a more detailed and nuanced representation of the data distribution.

-   Non-Parametric Estimation: Density plots do not assume a specific distributional form for the data. They are non-parametric and estimate the density function based on the observed data points. This flexibility makes them suitable for exploring and visualizing data that may not adhere to a specific distribution.

-   Comparison and Overlapping Distributions: Density plots make it easy to compare multiple distributions or visualize the overlap of different groups. By using different colors or line types, you can distinguish and compare distributions visually. This is particularly useful for identifying differences, similarities, or shifts in distributions across different groups or variables.

-   Communicating Findings: Density plots provide an effective way to communicate distributional characteristics and patterns to others. They offer a visually appealing and intuitive representation of data distributions, making it easier for others to grasp the underlying shape and variability.

```{r}
ggplot(data = gapminder, 
       aes(x = gdpPercap, 
           y = lifeExp)) +
  geom_bin2d(
    bins = c(50,50) # vectors for number of horizontal and vertical bins
  ) +
  scale_fill_viridis_c(
    name = "Number of countries"
  ) + # uses a viridis scale which is easier to visualize
  labs(x = "GDP per capita (USD)", # set x axis title
       y = "Life expectancy (years)") + # set y axis title
  theme(
    legend.position = "top", # set legend position
    legend.key.width = unit(2, "cm"), # set legend key width
    legend.title.position = "top" # move legend title 
  )
```

#### Adding statistics to your plots

`ggplot` has arguments for completing simple statistical tests through the extension `ggpubr`.

`ggpubr` offers additional functions and utilities for data visualization and statistical analysis, specifically designed to work seamlessly with `ggplot2`. Let's add statistics to some box plots. Here we will compare mean life expectancy between continents using the `gapminder` dataset.

Reference: <http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/#google_vignette>

First we need to check the normality of our data

```{r}
shapiro.test(gapminder$lifeExp)
```

Our P value is *much* lower than the 0.05 cutoff value which means our data are not normally distributed, so we need to take that into account when choosing a statistical test.

What test would you use to compare means across continents if the data *were* normally distributed?

```{r}
ggplot(data = gapminder,
       aes(x = continent,
           y = lifeExp,
           color = continent)) +
  geom_boxplot(outliers = FALSE) +
  geom_jitter(width = 0.2,
              alpha = 0.3) +
  stat_compare_means(
    method = "kruskal.test",
    label.y = 90 # sets position of P value label
  ) +
  guides( # guides allows us to finely manipulate legends
    color = guide_legend( # we want to adjust the color legend
      override.aes = aes(label = "") # remove the legend guide for the label
    )
  ) +
  scale_color_discrete(name = "Continent") +
  labs(x = NULL,
       y = "Life expectancy (years)") +
  theme(
    legend.position = "inside", # moves legend inside plot bounds
    legend.position.inside = c( 
      0.9,0.3 # values from 0-1 that set legend hor. and ver. position
    ), 
    legend.background = element_rect( # set outline color of legend box
      color = "black"
    )
  )
```

Our Kruskal-Wallis test tells us that there is a significant difference between groups, but not between which groups. If our data were normal and we performed an ANOVA we would perform a post-hoc test. Instead, we can use a pairwise Wilcoxon rank sum test.

```{r}
pairwise.wilcox.test(gapminder$lifeExp,
                     gapminder$continent,
                     p.adjust.method = "BH")
```

Now we can see that all continents are significantly different from one another!

#### Preparing plots for presentation

Sometimes we have very busy plots that would benefit from being revealed in sequence as part of storytelling in a presentation. For that, we can use `ggreveal`, which allows us to reveal parts of a plot sequentially and save each intermediate step for use. Let's look at the changes in life expectancy over several decades.

```{r}
gapminder %>%
  filter(year %in% c("1952","1962","1972","1982","1992","2002")) %>%
  filter(
    continent != "Oceania" # removing due to too few countries for density plotting
  ) %>%
  ggplot(aes(x = lifeExp,
             y = continent,
             fill = as.factor(year))) +
  geom_density_ridges(alpha = 0.5,
                      scale = 0.9) +
  labs(x = "Life expectancy (years)",
       y = "Continent") +
  scale_fill_discrete(name = "Year") -> life_exp
```

Now that we've created our life expectancy plot we can choose how to reveal the layers.

```{r}
reveal_aes(life_exp,
           aes = "fill") -> plot_list

plot_list
```

You can click through the different plots that `ggreveal` has made. Now let's save all these plots individually.

```{r}
reveal_save(
  plot_list, # list of plots
  "life_exp.tiff"
)
```

### Activities

#### Green 1

These activities will use the `gapminder` dataset which is a subset of country-by-country GDP from gapminder.org from 2010 and another of R's built-in datasets, `diamonds`, which contains data on diamond size, clarity, and color.

Perform normality tests on the gapminder dataset. Look specifically at the normality of the data from Europe.

```{r}

```

#### Green 2

Create a histogram using the `diamonds` data set. Facet so that they are aligned vertically.

```{r}

```

#### Blue 1

Create a 2D bin (density) plot showing GDP Per Capita and Life Expectancy using `geom_bind2d()`. Then, come up with a way to show all five continents separately

```{r}

```

#### Blue 2

Now add density lines to your plot using `geom_density2d()`.

```{r}

```

#### Black 1

Using the gapminder data, use one or more techniques to show the distribution of Life Expectancy over time for each continent.

**Hint: Explore the geoms we've used to day, or try a new one!** You can look up all available geoms here: <https://ggplot2.tidyverse.org/reference/index.html>

```{r}

```

#### Black 2

Visualize specific comparisons between continents and label with brackets and P values or significance markers.

**Hint: use the reference above to work out how to do this**

```{r}

```
